DWIE: An entity-centric dataset for multi-task document-level information extraction
نویسندگان
چکیده
Abstract This paper presents DWIE, the ‘Deutsche Welle corpus for Information Extraction’, a newly created multi-task dataset that combines four main Extraction (IE) annotation subtasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation (RE), and (iv) Linking. DWIE is conceived as an entity-centric describes interactions properties of conceptual entities on level complete document. contrasts with currently dominant mention-driven approaches start from detection classification named entity mentions in individual sentences. Further, presented two challenges when building evaluating IE models it. First, use traditional mention-level evaluation metrics NER RE tasks DWIE dataset can result measurements dominated by predictions more frequently mentioned entities. We tackle this issue proposing new entity-driven metric takes into account number compose each predicted ground truth Second, document-level annotations require to transfer information between located different parts document, well tasks, joint learning setting. To realize this, we propose graph-based neural message passing techniques mention spans. Our experiments show improvement up 5.5 F 1 percentage points incorporating graph propagation our model. demonstrates DWIE’s potential stimulate further research networks representation IE. make publicly available at https://github.com/klimzaporojets/DWIE .
منابع مشابه
An Entity-Level Approach to Information Extraction
We present a generative model of template-filling in which coreference resolution and role assignment are jointly determined. Underlying template roles first generate abstract entities, which in turn generate concrete textual mentions. On the standard corporate acquisitions dataset, joint resolution in our entity-level model reduces error over a mention-level discriminative approach by up to 20%.
متن کاملMulti-document Summarization for Terrorism Information Extraction
Counterterrorism is one of the major challenges to the society. In order to flight again the terrorists, it is very important to have a through understanding of the terrorism incidents. However, it is impossible for a human to read all the information related to a terrorism incident because of the large volume of information. Summarization technique is urgently required for analysis of terroris...
متن کاملEntity Centric Information Retrieval
In the past decade, the prosperity of the World Wide Web has led to fast explosion of information, and there is a long-standing demand on how to access such a huge volume of information effectively and efficiently. Information Retrieval (IR) aims to tackle the challenge by exploring approaches to obtain relevant information items (e.g., documents) relevant to a given information need (e.g., que...
متن کاملInformation Extraction from Multi-Document Threads
Information extraction (IE) is the task of extracting fragments of important information from natural language documents. Most IE research involves algorithms for learning to exploit regularities inherent in the textual information and language use, and such systems generally assume that each document can be processed in isolation. We are extending IE techniques to multi-document extraction tas...
متن کاملMulti-level Boundary Classification for Information Extraction
We investigate the application of classification techniques to the problem of information extraction (IE). In particular we use support vector machines and several different feature-sets to build a set of classifiers for IE. We show that this approach is competitive with current state-of-the-art IE algorithms based on specialized learning algorithms. We also introduce a new technique for improv...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Information Processing and Management
سال: 2021
ISSN: ['0306-4573', '1873-5371']
DOI: https://doi.org/10.1016/j.ipm.2021.102563